Data Visualization of Lunch Form¶

Goal of the notebook: I will be using the tool Bokeh to visualize data I have collected. This data is a csv file created from the generated questions and used on different questions answering models based on a Lunch form.

Side Note: Later in the notebook I change from using the Bokeh tool to using the Plotly visualization tool. The reason for switching over to plotly is for easy to use visualization tool which is also new for me.

1. Importing Libraries¶

In [223]:
import pandas as pd
import os  
import numpy as np

# Circle
from math import pi

import pandas as pd

from bokeh.palettes import Category20c
from bokeh.transform import cumsum
from bokeh.plotting import figure, output_notebook, show

from squarify import normalize_sizes, squarify

from bokeh.sampledata.sample_superstore import data
from bokeh.transform import factor_cmap

import plotly.express as px
import plotly

2. Importing data Merged Data frame¶

The data that will be used are csv files created from the generated questions and used on different questions answering models. The data is a combination of answers different models predicted for each label of a form.

In [2]:
df = pd.read_csv(r'C:\Users\victo\source\repos\Semester 7\JupyterLab\Group\Question Generator\csv_ouput\df_merged.csv', index_col=[0])
# delete one by one like column is 'Unnamed: 0' so use it's name
# df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()
Out[2]:
label questions answer score model percentage actual_answer
0 Number of Attendees what Number of Attendees? 15 62.77 model 1 1.539416 15
1 Number of Attendees who Number of Attendees? 15 67.06 model 1 1.644627 15
2 Number of Attendees where Number of Attendees? 15 46.99 model 1 1.152416 15
3 Number of Attendees when Number of Attendees? 15 55.51 model 1 1.361367 15
4 Number of Attendees why Number of Attendees? 15 55.14 model 1 1.352293 15

3. Exloratory Data Analysis using Bokeh¶

In this chapter

3.1. PieCharts¶

I will be visually representing the different labels that are used in the dataset

In [3]:
x = df.label.value_counts()
data = pd.Series(x).reset_index(name='value').rename(columns={'index': 'country'})
data['angle'] = data['value']/data['value'].sum() * 2*pi
data['color'] = Category20c[len(x)]
data
Out[3]:
country value angle color
0 Number of Attendees 60 0.628319 #3182bd
1 Budget 60 0.628319 #6baed6
2 Organizer 60 0.628319 #9ecae1
3 Contact Details 60 0.628319 #c6dbef
4 Date 60 0.628319 #e6550d
5 End Time 60 0.628319 #fd8d3c
6 Start Time 60 0.628319 #fdae6b
7 Food Allergies 60 0.628319 #fdd0a2
8 Food Diets 60 0.628319 #31a354
9 Location 60 0.628319 #74c476
In [5]:
p = figure(height=350, title="Pie Chart", toolbar_location=None,
           tools="hover", tooltips="@country: @value", x_range=(-0.5, 1.0))

p.wedge(x=0, y=1, radius=0.4,
        start_angle=cumsum('angle', include_zero=True), end_angle=cumsum('angle'),
        line_color="white", fill_color='color', legend_field='country', source=data)

p.axis.axis_label = None
p.axis.visible = False
p.grid.grid_line_color = None
output_notebook()
show(p)
Loading BokehJS ...

So a little background of the data. The data is based questions created for a lunch form. The lunch form has different kinds of labels which we will touch down in a bit but as we can see from the above piechart, we see the different form labels.

3.2. Hierarchical data¶

Treemaps¶

In [6]:
def treemap(df, col, x, y, dx, dy, *, N=100):
    sub_df = df.nlargest(N, col)
    normed = normalize_sizes(sub_df[col], dx, dy)
    print(x)
    print(y)
    print(dx)
    print(dy)
    blocks = squarify(normed, x, y, dx, dy)
    blocks_df = pd.DataFrame.from_dict(blocks).set_index(sub_df.index)
    return sub_df.join(blocks_df, how='left').reset_index()
In [72]:
df_copy = df.copy()
df_copy.head()
Out[72]:
label questions answer score model percentage actual_answer comparing_answers
0 Number of Attendees what Number of Attendees? 15 62.77 model 1 1.539416 15 True
1 Number of Attendees who Number of Attendees? 15 67.06 model 1 1.644627 15 True
2 Number of Attendees where Number of Attendees? 15 46.99 model 1 1.152416 15 True
3 Number of Attendees when Number of Attendees? 15 55.51 model 1 1.361367 15 True
4 Number of Attendees why Number of Attendees? 15 55.14 model 1 1.352293 15 True
In [9]:
df['comparing_answers'] = df.apply(lambda row: all(i in row.answer for i in row.actual_answer), axis=1)
In [10]:
df.head()
Out[10]:
label questions answer score model percentage actual_answer comparing_answers
0 Number of Attendees what Number of Attendees? 15 62.77 model 1 1.539416 15 True
1 Number of Attendees who Number of Attendees? 15 67.06 model 1 1.644627 15 True
2 Number of Attendees where Number of Attendees? 15 46.99 model 1 1.152416 15 True
3 Number of Attendees when Number of Attendees? 15 55.51 model 1 1.361367 15 True
4 Number of Attendees why Number of Attendees? 15 55.14 model 1 1.352293 15 True
In [11]:
df_correct_prediction = df[df.comparing_answers != False]
df_correct_prediction.head()
Out[11]:
label questions answer score model percentage actual_answer comparing_answers
0 Number of Attendees what Number of Attendees? 15 62.77 model 1 1.539416 15 True
1 Number of Attendees who Number of Attendees? 15 67.06 model 1 1.644627 15 True
2 Number of Attendees where Number of Attendees? 15 46.99 model 1 1.152416 15 True
3 Number of Attendees when Number of Attendees? 15 55.51 model 1 1.361367 15 True
4 Number of Attendees why Number of Attendees? 15 55.14 model 1 1.352293 15 True
In [12]:
a = df['model'].unique()
models = sorted(a)

print(sorted(models))
['model 1', 'model 2', 'model 3', 'model 4', 'model 5', 'model 6']
In [ ]:
 
In [73]:
data = data[["City", "Region", "Sales"]]

regions = ("West", "Central", "South", "East")

sales_by_city = data.groupby(["Region", "City"]).sum("Sales")
sales_by_city = sales_by_city.sort_values(by="Sales").reset_index()

sales_by_region = sales_by_city.groupby("Region").sum("Sales").sort_values(by="Sales")
In [74]:
data.shape
Out[74]:
(9994, 3)
In [84]:
sales_by_region
Out[84]:
Sales
Region
South 391721.9050
Central 501239.8908
East 678781.2400
West 725457.8245
In [90]:
score_by_label = df_correct_prediction.groupby(["model", "label"]).sum("comparing_answers")
score_by_label = score_by_label.sort_values(by="comparing_answers").reset_index()

score_by_model = score_by_label.groupby("model").sum("comparing_answers").sort_values(by="comparing_answers")
score_by_model
Out[90]:
score percentage comparing_answers
model
model 3 2557.03 146.987787 39
model 2 2195.60 90.754293 41
model 6 1213.33 43.869068 47
model 1 1035.87 40.455861 49
model 5 1942.50 105.797038 57
model 4 2117.84 102.528483 62
In [92]:
x, y, w, h = 0, 0, 800, 450

blocks_by_model = treemap(score_by_model, "comparing_answers", x, y, w, h)
In [93]:
blocks_by_model
Out[93]:
model score percentage comparing_answers x y dx dy
0 model 4 2117.84 102.528483 62 0.000000 0.000000 322.711864 234.453782
1 model 5 1942.50 105.797038 57 0.000000 234.453782 322.711864 215.546218
2 model 1 1035.87 40.455861 49 322.711864 0.000000 260.338983 229.687500
3 model 6 1213.33 43.869068 47 322.711864 229.687500 260.338983 220.312500
4 model 2 2195.60 90.754293 41 583.050847 0.000000 216.949153 230.625000
5 model 3 2557.03 146.987787 39 583.050847 230.625000 216.949153 219.375000
In [86]:
blocks_by_region = treemap(sales_by_region, "Sales", x, y, w, h)

dfs = []
for index, (Region, Sales, x, y, dx, dy) in blocks_by_region.iterrows():
    df = sales_by_city[sales_by_city.Region==Region]
    dfs.append(treemap(df, "Sales", x, y, dx, dy, N=10))
blocks = pd.concat(dfs)
In [87]:
blocks_by_region
Out[87]:
Region Sales x y dx dy
0 West 725457.8245 343.703270 0.000000 252.640624 450.000000
1 East 678781.2400 596.343894 0.000000 236.385508 450.000000
2 Central 501239.8908 832.729402 0.000000 310.973868 252.595298
3 South 391721.9050 832.729402 252.595298 310.973868 197.404702
In [77]:
def treemap(df, col, x, y, dx, dy, *, N=100):
    sub_df = df.nlargest(N, col)
    normed = normalize_sizes(sub_df[col], dx, dy)
    blocks = squarify(normed, x, y, dx, dy)
    blocks_df = pd.DataFrame.from_dict(blocks).set_index(sub_df.index)
    return sub_df.join(blocks_df, how='left').reset_index()
In [78]:
x, y, w, h = 0, 0, 800, 450
In [94]:
blocks_by_model
Out[94]:
model score percentage comparing_answers x y dx dy
0 model 4 2117.84 102.528483 62 0.000000 0.000000 322.711864 234.453782
1 model 5 1942.50 105.797038 57 0.000000 234.453782 322.711864 215.546218
2 model 1 1035.87 40.455861 49 322.711864 0.000000 260.338983 229.687500
3 model 6 1213.33 43.869068 47 322.711864 229.687500 260.338983 220.312500
4 model 2 2195.60 90.754293 41 583.050847 0.000000 216.949153 230.625000
5 model 3 2557.03 146.987787 39 583.050847 230.625000 216.949153 219.375000
In [ ]:
blocks_by_region = treemap(sales_by_region, "Sales", x, y, w, h)

dfs = []
for index, (model, score, percentage,comparing_answers,x, y, dx, dy) in blocks_by_model.iterrows():
    df = sales_by_city[sales_by_city.Region==Region]
    dfs.append(treemap(df, "Sales", x, y, dx, dy, N=10))
blocks = pd.concat(dfs)
In [96]:
dfs = []
for index, (model, score, percentage,comparing_answers,x, y, dx, dy) in blocks_by_model.iterrows():
    df_score = score_by_label[score_by_label.model==model]
    # print(df_score)
    dfs.append(treemap(df_score, "comparing_answers", x, y, dx, dy, N=10))
blocks = pd.concat(dfs)
In [100]:
blocks
Out[100]:
index model label score percentage comparing_answers x y dx dy ytop
0 35 model 4 Date 505.25 16.538516 10 0.000000 0.000000 104.100601 117.226891 117.226891
1 36 model 4 Number of Attendees 522.17 12.806068 10 0.000000 117.226891 104.100601 117.226891 234.453782
2 37 model 4 End Time 392.80 27.215032 10 104.100601 0.000000 109.305631 111.644658 111.644658
3 44 model 4 Start Time 446.76 30.519520 10 213.406233 0.000000 109.305631 111.644658 111.644658
4 23 model 4 Food Allergies 3.02 0.243340 7 104.100601 111.644658 69.558129 122.809124 234.453782
5 16 model 4 Budget 0.00 0.000000 5 173.658731 111.644658 89.431880 68.227291 179.871949
6 11 model 4 Organizer 222.36 12.361781 4 173.658731 179.871949 89.431880 54.581833 234.453782
7 13 model 4 Contact Details 25.45 2.841099 4 263.090611 111.644658 59.621254 81.872749 193.517407
8 5 model 4 Food Diets 0.03 0.003127 2 263.090611 193.517407 59.621254 40.936375 234.453782
0 32 model 5 Location 512.78 57.877806 10 0.000000 234.453782 107.570621 113.445378 347.899160
1 29 model 5 Number of Attendees 704.32 17.273245 9 0.000000 347.899160 107.570621 102.100840 450.000000
2 30 model 5 Date 495.57 16.221657 9 107.570621 234.453782 113.898305 96.428571 330.882353
3 27 model 5 Start Time 41.37 2.826109 8 221.468927 234.453782 101.242938 96.428571 330.882353
4 17 model 5 End Time 0.00 0.000000 5 107.570621 330.882353 102.448211 59.558824 390.441176
5 18 model 5 Food Diets 23.05 2.402243 5 107.570621 390.441176 102.448211 59.558824 450.000000
6 12 model 5 Budget 0.00 0.000000 4 210.018832 330.882353 64.396018 75.802139 406.684492
7 8 model 5 Organizer 165.40 9.195172 3 274.414851 330.882353 48.297014 75.802139 406.684492
8 6 model 5 Food Allergies 0.01 0.000806 2 210.018832 406.684492 56.346516 43.315508 450.000000
9 7 model 5 Contact Details 0.00 0.000000 2 266.365348 406.684492 56.346516 43.315508 450.000000
0 41 model 1 Number of Attendees 603.97 14.812190 10 322.711864 0.000000 106.260809 114.843750 114.843750
1 42 model 1 End Time 149.97 10.390627 10 322.711864 114.843750 106.260809 114.843750 229.687500
2 43 model 1 Date 198.11 6.484800 10 428.972674 0.000000 154.078174 79.202586 79.202586
3 24 model 1 Start Time 2.25 0.153704 7 428.972674 79.202586 113.531286 75.242457 154.445043
4 25 model 1 Contact Details 72.74 8.120297 7 428.972674 154.445043 113.531286 75.242457 229.687500
5 9 model 1 Organizer 8.75 0.486444 3 542.503960 79.202586 40.546888 90.290948 169.493534
6 1 model 1 Budget 0.08 0.007798 2 542.503960 169.493534 40.546888 60.193966 229.687500
0 31 model 6 Date 257.43 8.426541 10 322.711864 229.687500 110.782546 110.156250 339.843750
1 33 model 6 End Time 90.44 6.266109 10 322.711864 339.843750 110.782546 110.156250 450.000000
2 34 model 6 Number of Attendees 685.84 16.820028 10 433.494410 229.687500 149.556437 81.597222 311.284722
3 45 model 6 Start Time 177.57 12.130341 10 433.494410 311.284722 87.974375 138.715278 450.000000
4 10 model 6 Organizer 0.05 0.002780 3 521.468785 311.284722 61.582062 59.449405 370.734127
5 3 model 6 Budget 0.00 0.000000 2 521.468785 370.734127 61.582062 39.632937 410.367063
6 4 model 6 Contact Details 2.00 0.223269 2 521.468785 410.367063 61.582062 39.632937 450.000000
0 39 model 2 Number of Attendees 755.63 18.531607 10 583.050847 0.000000 108.474576 112.500000 112.500000
1 40 model 2 Date 928.42 30.390280 10 691.525424 0.000000 108.474576 112.500000 112.500000
2 22 model 2 End Time 226.30 15.679129 7 583.050847 112.500000 72.316384 118.125000 230.625000
3 19 model 2 Contact Details 79.93 8.922950 5 655.367232 112.500000 103.309120 59.062500 171.562500
4 20 model 2 Budget 109.86 10.709168 5 655.367232 171.562500 103.309120 59.062500 230.625000
5 14 model 2 Start Time 95.46 6.521160 4 758.676352 112.500000 41.323648 118.125000 230.625000
0 38 model 3 Number of Attendees 788.75 19.343866 10 583.050847 230.625000 114.183764 106.875000 337.500000
1 28 model 3 Budget 679.74 66.261149 9 697.234612 230.625000 102.765388 106.875000 337.500000
2 26 model 3 Organizer 521.00 28.964237 7 583.050847 337.500000 75.932203 112.500000 450.000000
3 21 model 3 Date 379.74 12.430155 6 658.983051 337.500000 65.084746 112.500000 450.000000
4 15 model 3 Contact Details 164.96 18.415236 4 724.067797 337.500000 75.932203 64.285714 401.785714
5 2 model 3 Start Time 9.59 0.655122 2 724.067797 401.785714 50.621469 48.214286 450.000000
6 0 model 3 End Time 13.25 0.918022 1 774.689266 401.785714 25.310734 48.214286 450.000000
In [99]:
p = figure(width=w, height=h, tooltips="@label", toolbar_location=None,
           x_axis_location=None, y_axis_location=None)
p.x_range.range_padding = p.y_range.range_padding = 0
p.grid.grid_line_color = None

p.block('x', 'y', 'dx', 'dy', source=blocks, line_width=1, line_color="white",
        fill_alpha=0.8, fill_color=factor_cmap("model", "MediumContrast4", regions))

p.text('x', 'y', x_offset=2, text="model", source=blocks_by_model,
       text_font_size="18pt",  text_color="white")

blocks["ytop"] = blocks.y + blocks.dy
p.text('x', 'ytop', x_offset=2, y_offset=2, text="label", source=blocks,
       text_font_size="6pt", text_baseline="top",
       text_color=factor_cmap("model", ("black", "white", "black", "white","black"), models))

show(p)
In [105]:
df.head()
Out[105]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
In [104]:
df_correct_prediction.head()
Out[104]:
label questions answer score model percentage actual_answer comparing_answers
0 Number of Attendees what Number of Attendees? 15 62.77 model 1 1.539416 15 True
1 Number of Attendees who Number of Attendees? 15 67.06 model 1 1.644627 15 True
2 Number of Attendees where Number of Attendees? 15 46.99 model 1 1.152416 15 True
3 Number of Attendees when Number of Attendees? 15 55.51 model 1 1.361367 15 True
4 Number of Attendees why Number of Attendees? 15 55.14 model 1 1.352293 15 True
In [113]:
df_correct_prediction['occurence'] = 1
df_correct_prediction.head()
Out[113]:
label questions answer score model percentage actual_answer comparing_answers occurence
0 Number of Attendees what Number of Attendees? 15 62.77 model 1 1.539416 15 True 1
1 Number of Attendees who Number of Attendees? 15 67.06 model 1 1.644627 15 True 1
2 Number of Attendees where Number of Attendees? 15 46.99 model 1 1.152416 15 True 1
3 Number of Attendees when Number of Attendees? 15 55.51 model 1 1.361367 15 True 1
4 Number of Attendees why Number of Attendees? 15 55.14 model 1 1.352293 15 True 1

4. Exploratory Data Analysis using Plotly¶

After seeing how hard it was to set up a treemap in bokeh compared to plotly i chose to do my visualization in plotly for easy to use.

4.1. Treemaps¶

4.1.1. Visualizing Number of Occurence¶

I will be visualizing the number of occurences each model predicts the correct answer for each label to see which model performs the best overal.

In [170]:
fig = px.treemap(df_correct_prediction, path=[px.Constant("all"), 'label', 'model', 'actual_answer'], values='occurence', title="Prediction Occurence of each Labels Based on types of models")
fig.update_traces(root_color="lightgrey", marker=dict(cornerradius=5))
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
In [171]:
fig = px.treemap(df_correct_prediction, path=[px.Constant("all"), 'model', 'label'], values='occurence', title="Prediction occurence of each model based on each of form label")
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

4.1.2. Visualizing total Confidence Score¶

I will be visualizing the total confidence score for each model of predicting the correct answer for each label to see which model performs the best overal.

In [172]:
fig = px.treemap(df_correct_prediction, path=[px.Constant("all"), 'label', 'model', 'actual_answer'], values='score', title="Prediction confidence score of each labels based on types of models")
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()
In [173]:
import plotly.express as px
fig = px.treemap(df_correct_prediction, path=[px.Constant("all"), 'model', 'label'], values='score', title="Prediction confidence score of each model based on each of form label")
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

4.2. PieCharts¶

In [ ]:
df_correct_prediction.head()
Out[ ]:
label questions answer score model percentage actual_answer comparing_answers occurence
0 Number of Attendees what Number of Attendees? 15 62.77 model 1 1.539416 15 True 1
1 Number of Attendees who Number of Attendees? 15 67.06 model 1 1.644627 15 True 1
2 Number of Attendees where Number of Attendees? 15 46.99 model 1 1.152416 15 True 1
3 Number of Attendees when Number of Attendees? 15 55.51 model 1 1.361367 15 True 1
4 Number of Attendees why Number of Attendees? 15 55.14 model 1 1.352293 15 True 1
In [130]:
fig = px.pie(df_correct_prediction, values='occurence', names='label',color_discrete_sequence=px.colors.sequential.RdBu, title='Occurence of Predicting Correct Answer')
fig.show()

4.3. Dot Plot¶

Dot plots (also known as Cleveland dot plots) are scatter plots with one categorical axis and one continuous axis. They can be used to show changes between two (or more) points in time or between two (or more) conditions. Compared to a bar chart, dot plots can be less cluttered and allow for an easier comparison between conditions.

In [200]:
fig = px.scatter(score_by_label.sort_values('model'), y="label", x="comparing_answers", color="model", symbol="model", title = '')
fig.update_traces(marker_size=10)
fig.show()

4.4. Horizontal Bar Charts in Python¶

4.4.1. Visualizing Number of Occurence¶

I will be visualizing the number of occurences each model predicts the correct answer for each label to see which model performs the best overal with using the horizontal bar chart.

In [199]:
fig = px.bar(score_by_label.sort_values('model'), x="comparing_answers", y="label", color='model', orientation='h',
             hover_data=["comparing_answers", "score"],
             height=400,
             title='Number of Predicted Correct Occurence per Label for each model')
fig.show()

4.4.2. Visualizing total Confidence Score¶

I will be visualizing the total confidence score for each model of predicting the correct answer for each label to see which model performs the best overal with using the horizontal bar chart.

In [198]:
fig = px.bar(score_by_label.sort_values('model'), x="score", y="label", color='model', orientation='h',
             hover_data=["comparing_answers", "score"],
             height=400,
             title='Number of Predicted Correct Occurence per Label for each model')
fig.show()

4.5. Sunburst Charts¶

Sunburst plots visualize hierarchical data spanning outwards radially from root to leaves. Similar to Icicle charts and Treemaps, the hierarchy is defined by labels (names for px.icicle) and parents attributes. The root starts from the center and children are added to the outer rings.

In [201]:
fig = px.sunburst(score_by_label, path=['label', 'model'], values='comparing_answers')
fig.show()
In [208]:
fig = px.sunburst(score_by_label, path=['label', 'model'], values='score',width=1000, height=500, title = 'Prediction confidence score of each labels based on types of models')
fig.show()

4.6. Icicle Charts¶

Icicle charts visualize hierarchical data using rectangular sectors that cascade from root to leaves in one of four directions: up, down, left, or right. Similar to Sunburst charts and Treemaps charts, the hierarchy is defined by labels (names for px.icicle) and parents attributes. Click on one sector to zoom in/out, which also displays a pathbar on the top of your icicle. To zoom out, you can click the parent sector or click the pathbar as well.

In [221]:
fig = px.icicle(score_by_label, path=[px.Constant("Lunch Form"), 'label', 'model'], values='comparing_answers')
fig.update_traces(root_color="lightgrey")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

4.7. Patterned Charts¶

In [ ]:
 
In [218]:
fig = px.area(score_by_label, x="label", y="comparing_answers", color="model", pattern_shape="model")
fig.show()

fig.write_html(r"C:\Users\victo\source\repos\Semester 7\JupyterLab\Data Visualization\file.html")
In [224]:
plotly.offline.init_notebook_mode()

Conclusion¶

To conclude, this notebook shows different visualization plots that express the data collected.